Revisiting Automatic Transliteration Problem for Code-Mixed Romanized Indian Social Media Text
نویسندگان
چکیده
Although automatic Transliteration for Indian languages is a well studied paradigm, but availab le t ransliteration techniques fail in the Indian social media context due to phenomena such as wordplay, creative spelling, codemixing, and phonetic romanized typing; all implying that transliteration for Indian social media text has to be revisited. The paper reports an init ial study on automatic transliteration for a Facebook message corpus in mixed English-Bengali-Hind i fo r restoration of Hindi and Bengali code-mixed words into Devanagari and Bengali script respectively.
منابع مشابه
POS Tagging of Hindi-English Code Mixed Text from Social Media: Some Machine Learning Experiments
We discuss Part-of-Speech(POS) tagging of Hindi-English Code-Mixed(CM) text from social media content. We propose extensions to the existing approaches, we also present a new feature set which addresses the transliteration problem inherent in social media. We achieve an 84% accuracy with the new feature set. We show that the context and joint modeling of language detection and POS tag layers do...
متن کاملRomanized Berber and Romanized Arabic Automatic Language Identification Using Machine Learning
The identification of the language of text/speech input is the first step to be able to properly do any language-dependent natural language processing. The task is called Automatic Language Identification (ALI). Being a well-studied field since early 1960’s, various methods have been applied to many standard languages. The ALI standard methods require datasets for training and use character/wor...
متن کاملFinding Romanized Arabic Dialect in Code-Mixed Tweets
Recent computational work on Arabic dialect identification has focused primarily on building and annotating corpora written in Arabic script. Arabic dialects however also appear written in Roman script, especially in social media. This paper describes our recent work developing tweet corpora and a token-level classifier that identifies a romanized Arabic dialect and distinguishes it from French...
متن کاملPOS Tagging of English-Hindi Code-Mixed Social Media Content
Code-mixing is frequently observed in user generated content on social media, especially from multilingual users. The linguistic complexity of such content is compounded by presence of spelling variations, transliteration and non-adherance to formal grammar. We describe our initial efforts to create a multi-level annotated corpus of Hindi-English codemixed text collated from Facebook forums, an...
متن کاملIdentifying Languages at the Word Level in Code-Mixed Indian Social Media Text
Language identification at the document level has been considered an almost solved problem in some application areas, but language detectors fail in the social media context due to phenomena such as utterance internal code-switching, lexical borrowings, and phonetic typing; all implying that language identification in social media has to be carried out at the word level. The paper reports a stu...
متن کامل